NMAR

NMAR: An R Package for Estimation under Nonignorable Nonresponse in Sample Surveys

Tackling Nonignorable Missingness in Official Surveys with R


National Science Centre grant Poland (OPUS 20 grant no. 2020/39/B/HS4/00941)


dr Maciej Beręsewicz (Poznan University of Business and Economics),
Igor Kołodziej (WUT - Faculty of Mathematics),
Mateusz Iwaniuk (WUT - Faculty of Mathematics)


NMAR - Not Missing At Random

Logo

Motivation: The Challenge of Nonignorable Nonresponse (NMAR)

  1. Missing Data Structure: We are interesting in estimating the mean of \(Y\) which is subject to missingness. Each observation \(y_i\) has its corresponding respone indicator \(\delta_i\):

\[ \delta_i = \begin{cases} 1 & \text{if } y_i \text{ is observed} \\ 0 & \text{if } y_i \text{ is missing} \end{cases} \]

  1. Outcome Mechanism: Having a set of auxiliary covariates \(X_1\), we model

\[\mathbb{P}(Y \mid X_1) \text{, or nonparametric } F(Y, X_1)\]

  1. Response Mechanism: The missingness of \(Y\) depends on \(Y\) itself, and perhaps on other covariates \(X_2\)

\[P(\delta=1\mid Y, X_2) \neq P(\delta=1\mid X_2)\]

Note: The sets of auxiliary and response model covariates may or may not overlap depending the method.

Practical Example: Salary Survey

Scenario: Richer people tend not to answer questions about salary

\(Y\): Salary (with missing values)
\(\delta\): Response indicator (1 = answered, 0 = refused)


Outcome Mechanism :

\(Y \sim \text{experience} + \text{education}\)


Response Mechanism:

Assuming that apart from salary, it is also affected by age

\(\delta \sim Y + \text{age}\)

The Case of the Missing Salaries

Simulation Results

NMAR estimators are closer to true population mean compared to MAR estimators, naive sample mean is badly biased.

Exponential Tilting - Riddles et al (2016) (1/2)

Assumptions

  • Estimate density of observed data: \(f(y|X_1, \delta = 1)\), e.g., salary \(\sim\) experience + education
  • Use response model (logit/probit) to estimate \(\pi(x_{1i}, y_i; \phi) = P(\delta_i = 1 | X_{2i}, y_i)\) based on \(y\) (e.g. salary), \(X_2\) (e.g., age), and intercept

1. Estimate Observed Distribution

\[ \hat{f}_1(y) = f(y|X_1, \delta = 1; \gamma) \]

2. Define score function

\[ S_{\text{obs}} = \sum_{j:\delta_j=1} s(y_j, X_{1j}) = \sum_{j:\delta_j=1} \left[ \frac{\delta_j - \pi(X_{1j}, y_j; \phi)}{\pi(X_{1j}, y_j; \phi)\{1-\pi(X_{1j}, y_j; \phi)\}} \frac{\partial \pi(X_{1j}, y_j; \phi)}{\partial \phi} \right] \]

\[ S_{\text{mis}} = \sum_{j:\delta_j=1} \sum_{i:\delta_i=0} w_{ji} \cdot s(y_j, X_{1i}) = \sum_{j:\delta_j=1} \sum_{i:\delta_i=0} w_{ji} \cdot \left[ \frac{\delta_i - \pi(X_{1i}, y_j; \phi)}{\pi(X_{1i}, y_j; \phi)\{1-\pi(X_{1i}, y_j; \phi)\}} \frac{\partial \pi(X_{1i}, y_j; \phi)}{\partial \phi} \right] \]

Exponential Tilting - Riddles et al (2016) (2/2)

3. Calculate weights

\[ w_{ji}^*(\phi, \gamma) = \frac{\frac{O(\mathbf{x}_{1i}, y_j; \phi) f_1(y_j|\mathbf{x}_i; \gamma)}{\sum_m f_m(y_j|\mathbf{x}_i; \gamma)}}{\sum_{k:\delta_k=1} \frac{O(\mathbf{x}_{1i}, y_k; \phi) f_1(y_k|\mathbf{x}_i; \gamma)}{\sum_m f_m(y_k|\mathbf{x}_i; \gamma)}} \]

4. Solve Final Equation

\[ S_{\text{total}} = S_{\text{obs}} + S_{\text{mis}} = 0 \]

5. Estimate Population Mean

\[ \mathbb{E}[Y] = \hat{\theta} = \frac{\sum_{i:\delta_i=1} \frac{1}{\pi_i(\hat{\phi}_p)} y_i}{\sum_{i:\delta_i=1} \frac{1}{\pi_i(\hat{\phi}_p)}} \]

Empirical Likelihood: Core Idea (1/2)

Based on Qin, Leung & Shao (JASA, 2002). Model only the response mechanism; keep the outcome distribution nonparametric.

\[ w(y, x; \beta) = P(\delta = 1 \mid Y=y, X=x) \quad (\text{e.g., logit/probit}), \qquad W = \iint w(y, x; \beta) \, dF(y, x). \]

Profile out \(F\) (for fixed \((\beta, W)\)) by placing masses \(p_i\) on respondents \((y_i, x_i)\) and enforcing:

  • \(\sum p_i = 1\) (probability)
  • \(\sum p_i \{ w(y_i, x_i; \beta) - W \} = 0\) (response rate)
  • \(\sum p_i (x_i - \mu_x) = 0\) (auxiliaries, if available)

This yields the EL weights

\[ p_i = \frac{1}{n} \frac{1}{1 + \lambda_W (w(y_i, x_i; \beta) - W) + (x_i - \mu_x)^\top \lambda_x}. \]

Empirical Likelihood: Mean Estimate (2/2)

\[ \hat{\mu}_Y = \sum_{i:\delta_i=1} p_i y_i \]

How are the weights obtained?

Profile out \(F\) and solve the estimating equations for \((\beta, W, \lambda_W, \lambda_x)\), plug back into the formula for \(p_i\).


Survey sampling?

Can be extended to survey sampling by incorporating the design weights into the likelihood and augumenting auxiliary constraints with stratum indicators.


Properties

  • Consistent if the response model \(w(y, x; \beta)\) is correctly specified.
  • Asymptotically normal (under regularity conditions).
  • Auxiliaries improve efficiency.

NMAR Package in Action

exptilt_config <- exptilt_engine(
    standardize = FALSE, #Common
    on_failure = "return", # Common
    variance_method = "bootstrap", #Common
    bootstrap_reps = 20, # Common
    control = list(), # Common
    family = "logit", # Common
    y_dens = "normal",
    stopping_threshold = 1
)
formula = Y ~ x1 + x2
res <- nmar(formula = formula, data = x, engine = exptilt_config, trace_level=0)
print(res)
NMAR Result
------------
Y mean: -1.048245 (0.049008)
Converged: TRUE 
Variance method: bootstrap 
Estimator: exponential_tilting 
coef(res)
(Intercept)           Y 
  0.5229152  -0.2465591 
True Y mean:           -1.0408 
Est Y mean (NMAR):     -1.0482   3σ interval: ( -1.1218 ,  -0.9747 σ= 0.0490 )
Naive Y mean (MAR):    -1.1842 

Improving R folder

Simulation Results
Simulation Results

CI/CD via Github Actions

Simulation Results

Simulation Results

Future Work and Challenges

  • Large-scale data handling
  • Analytical variance calculation
  • Extension to new estimation approaches
  • Extending estimation parameters beyond mean (quantiles, gini index)
  • Rewriting Exptilt EM and Likelihood Jacobians in C/C++ for improved performance

References

  • Qin, J., Leung, D., & Shao, J. (2002). Estimation With Survey Data Under Nonignorable Nonresponse or Informative Sampling. Journal of the American Statistical Association, 97(457), 193–200. https://doi.org/10.1198/016214502753479338
  • Minsun Kim Riddles, Jae Kwang Kim, Jongho Im, A Propensity-score-adjustment Method for Nonignorable Nonresponse, Journal of Survey Statistics and Methodology, Volume 4, Issue 2, June 2016, Pages 215–245, https://doi.org/10.1093/jssam/smv047